14 research outputs found

    XARK: an extensible framework for automatic recognition of computational kernels

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in ACM Transactions on Programming Languages and Systems. The final authenticated version is available online at: http://dx.doi.org/10.1145/1391956.1391959[Abstract] The recognition of program constructs that are frequently used by software developers is a powerful mechanism for optimizing and parallelizing compilers to improve the performance of the object code. The development of techniques for automatic recognition of computational kernels such as inductions, reductions and array recurrences has been an intensive research area in the scope of compiler technology during the 90's. This article presents a new compiler framework that, unlike previous techniques that focus on specific and isolated kernels, recognizes a comprehensive collection of computational kernels that appear frequently in full-scale real applications. The XARK compiler operates on top of the Gated Single Assignment (GSA) form of a high-level intermediate representation (IR) of the source code. Recognition is carried out through a demand-driven analysis of this high-level IR at two different levels. First, the dependences between the statements that compose the strongly connected components (SCCs) of the data-dependence graph of the GSA form are analyzed. As a result of this intra-SCC analysis, the computational kernels corresponding to the execution of the statements of the SCCs are recognized. Second, the dependences between statements of different SCCs are examined in order to recognize more complex kernels that result from combining simpler kernels in the same code. Overall, the XARK compiler builds a hierarchical representation of the source code as kernels and dependence relationships between those kernels. This article describes in detail the collection of computational kernels recognized by the XARK compiler. Besides, the internals of the recognition algorithms are presented. The design of the algorithms enables to extend the recognition capabilities of XARK to cope with new kernels, and provides an advanced symbolic analysis framework to run other compiler techniques on demand. Finally, extensive experiments showing the effectiveness of XARK for a collection of benchmarks from different application domains are presented. In particular, the SparsKit-II library for the manipulation of sparse matrices, the Perfect benchmarks, the SPEC CPU2000 collection and the PLTMG package for solving elliptic partial differential equations are analyzed in detail.Ministeiro de Educación y Ciencia; TIN2004-07797-C02Ministeiro de Educación y Ciencia; TIN2007-67537-C03Xunta de Galicia; PGIDIT05PXIC10504PNXunta de Galicia; PGIDIT06PXIB105228P

    Compiler support for parallel code generation through kernel recognition

    Get PDF
    [Abstract] Summary form only given. The automatic parallelization of loops that contain complex computations is still a challenge for current parallelizing compilers. The main limitations are related to the analysis of expressions that contain subscripted subscripts, and the analysis of conditional statements that introduce complex control flows at run-time. We use the term complex loop to designate loops with such characteristics. We describe the parallelization of sequential complex loop nests using a generic compiler framework (proposed in an earlier paper [Arenaz et al., ICS'2003] ) that accomplishes kernel recognition through the analysis of the gated single assignment program representation. Specifically, we focus on an extension of this framework that enables its use as a powerful tool for gathering source code information that is relevant for the parallelization of each computational kernel. A set of example codes are analyzed in detail to illustrate the potential of our approach. Experimental results using a benchmark suite of complex loop nests are also presented

    A Grid Portal for an Undergraduate Parallel Programming Course

    Get PDF
    [Abstract] This paper describes an experience of designing and implementing a portal to support transparent remote access to supercomputing facilities to students enrolled in an undergraduate parallel programming course. As these facilities are heterogeneous, are located at different sites, and belong to different institutions, grid computing technologies have been used to overcome these issues. The result is a grid portal based on a modular and easily extensible software architecture that provides a uniform and user-friendly interface for students to work on their programming laboratory assignments.Universidade da Coruña; UDC-TIC03-057Xunta de Galicia; PGIDIT02TIC00103CTXunta de Galicia; PGIDIT04TIC105004PREuropean Commission; IST-2001-3224

    A Novel Compiler Support for Automatic Parallelization on Multicore Systems

    Get PDF
    [Abstract] The widespread use of multicore processors is not a consequence of significant advances in parallel programming. In contrast, multicore processors arise due to the complexity of building power-efficient, high-clock-rate, single-core chips. Automatic parallelization of sequential applications is the ideal solution for making parallel programming as easy as writing programs for sequential computers. However, automatic parallelization remains a grand challenge due to its need for complex program analysis and the existence of unknowns during compilation. This paper proposes a new method for converting a sequential application into a parallel counterpart that can be executed on current multicore processors. It hinges on an intermediate representation based on the concept of domain-independent kernel (e.g., assignment, reduction, recurrence). Such kernel-centric view hides the complexity of the implementation details, enabling the construction of the parallel version even when the source code of the sequential application contains different syntactic variations of the computations (e.g., pointers, arrays, complex control flows). Experiments that evaluate the effectiveness and performance of our approach with respect to state-of-the-art compilers are also presented. The benchmark suite consists of synthetic codes that represent common domain-independent kernels, dense/sparse linear algebra and image processing routines, and full-scale applications from SPEC CPU2000.[Resumen] El uso generalizado de procesadores multinúcleo no es consecuencia de avances significativos en programación paralela. Por el contrario, los procesadores multinúcleo surgen debido a la complejidad de construir chips mononúcleo que sean eficiente energéticamente y tengan altas velocidades de reloj. La paralelización automática de aplicaciones secuenciales es la solución ideal para hacer la programación paralela tan fácil como escribir programas para ordenadores secuenciales. Sin embargo, la paralelización automática continua a ser un gran reto debido a su necesidad de complejos análisis del programa y la existencia de incógnitas durante la compilación. Este artículo propone un nuevo método para convertir una aplicación secuencial en su contrapartida paralela que pueda ser ejecutada en los procesadores multinúcleo actuales. Este método depende de una representación intermedia basada en el concepto de núcleos independientes del dominio (p. ej., asignación, reducción, recurrencia). Esta visión centrada en núcleos oculta la complejidad de los detalles de implementación, permitiendo la construcción de la versión paralela incluso cuando el código fuente de la aplicación secuencial contiene diferentes variantes de las computaciones (p. ej., punteros, arrays, flujos de control complejos). Se presentan experimentos que evalúan la efectividad y el rendimiento de nuestra aproximación con respecto al estado del arte. La serie programas de prueba consiste en códigos sintéticos que representan núcleos independientes del dominio comunes, rutinas de álgebra lineal densa/dispersa y de procesamiento de imagen, y aplicaciones completas del SPEC CPU2000.[Resumo] O uso xeralizado de procesadores multinúcleo non é consecuencia de avances significativos en programación paralela. Pola contra, os procesadores multinúcleo xurden debido á complexidade de construir chips mononúcleo que sexan eficientes enerxéticamente e teñan altas velocidades de reloxo. A paralelización automática de aplicacións secuenciais é a solución ideal para facer a programación paralela tan sinxela como escribir programas para ordenadores secuenciais. Sen embargo, a paralelización automática continua a ser un gran reto debido a súa necesidade de complexas análises do programa e a existencia de incógnitas durante a compilación. Este artigo propón un novo método para convertir unha aplicación secuencias na súa contrapartida paralela que poida ser executada nos procesadores multinúcleo actuais. Este método depende dunha representación intermedia baseada no concepto dos núcleos independentes do dominio (p. ex., asignación, reducción, recurrencia). Esta visión centrada en núcleos oculta a complexidade dos detalles de implementación, permitindo a construcción da versión paralela incluso cando o código fonte da aplicación secuencial contén diferentes variantes das computacións (p. ex., punteiros, arrays, fluxos de control complejo). Preséntanse experimentos que evalúan a efectividade e o rendemento da nosa aproximación con respecto ao estado da arte. A serie de programas de proba consiste en códigos sintéticos que representan núcleos independentes do dominio comunes, rutinas de álxebra lineal densa/dispersa e de procesamento de imaxe, e aplicacións completas do SPEC CPU2000.Ministerio de Economía y Competitividad; TIN2010-16735Ministerio de Educación y Cultura; AP2008-0101

    Automated and accurate cache behavior analysis for codes with irregular access patterns

    Get PDF
    This is the peer reviewed version of the following article: Andrade, D. , Arenaz, M. , Fraguela, B. B., Touriño, J. and Doallo, R. (2007), Automated and accurate cache behavior analysis for codes with irregular access patterns. Concurrency Computat.: Pract. Exper., 19: 2407-2423. doi:10.1002/cpe.1173, which has been published in final form at https://doi.org/10.1002/cpe.1173. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions.[Abstract] The memory hierarchy plays an essential role in the performance of current computers, so good analysis tools that help in predicting and understanding its behavior are required. Analytical modeling is the ideal base for such tools if its traditional limitations in accuracy and scope of application can be overcome. While there has been extensive research on the modeling of codes with regular access patterns, less attention has been paid to codes with irregular patterns due to the increased difficulty in analyzing them. Nevertheless, many important applications exhibit this kind of pattern, and their lack of locality make them more cache‐demanding, which makes their study more relevant. The focus of this paper is the automation of the Probabilistic Miss Equations (PME) model, an analytical model of the cache behavior that provides fast and accurate predictions for codes with irregular access patterns. The information requirements of the PME model are defined and its integration in the XARK compiler, a research compiler oriented to automatic kernel recognition in scientific codes, is described. We show how to exploit the powerful information‐gathering capabilities provided by this compiler to allow the automated modeling of loop‐oriented scientific codes. Experimental results that validate the correctness of the automated PME model are also presented.Ministerio de Educación y Ciencia; TIN2004-07797-C02Xunta de Galicia; PGIDIT03TIC10502PRXunta de Galicia; PGIDT05PXIC10504P

    Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in International Journal of Parallel Programming. The final authenticated version is available online at: https://doi.org/10.1007/s10766-015-0362-9[Abstract] The use of GPUs for general purpose computation has increased dramatically in the past years due to the rising demands of computing power and their tremendous computing capacity at low cost. Hence, new programming models have been developed to integrate these accelerators with high-level programming languages, giving place to heterogeneous computing systems. Unfortunately, this heterogeneity is also exposed to the programmer complicating its exploitation. This paper presents a new technique to automatically rewrite sequential programs into a parallel counterpart targeting GPU-based heterogeneous systems. The original source code is analyzed through domain-independent computational kernels, which hide the complexity of the implementation details by presenting a non-statement-based, high-level, hierarchical representation of the application. Next, a locality-aware technique based on standard compiler transformations is applied to the original code through OpenHMPP directives. Two representative case studies from scientific applications have been selected: the three-dimensional discrete convolution and the simple-precision general matrix multiplication. The effectiveness of our technique is corroborated by a performance evaluation on NVIDIA GPUs.Ministerio de Economía y Competitividad; TIN2010-16735Ministerio de Economía y Competitividad; TIN2013-42148-PGalicia, Consellería de Cultura, Educación e Ordenación Universitaria; GRC2013-055Ministerio de Educación; AP2008-0101

    Parallelization of shallow water simulations on current multi-threaded systems

    Get PDF
    Lobeiras, J., Viñas, M., Amor, M., Fraguela, B.B., Arenaz, M., García, J., Castro, M. Parallelization of shallow water simulations on current multi-threaded systems. The International Journal of High Performance Computing Applications 27, 493–512. © 2013 The Author(s), © SAGE Publications. https://doi.org/10.1177/1094342012464800[Abstract]: In this work, several parallel implementations of a numerical model of pollutant transport on a shallow water system are presented. These parallel implementations are developed in two phases. First, the sequential code is rewritten to exploit the stream programming model. And second, the streamed code is targeted for current multi-threaded systems, in particular, multi-core CPUs and modern GPUs. The performance is evaluated on a multi-core CPU using OpenMP, and on a GPU using the streaming-oriented programming language Brook+, as well as the standard language for heterogeneous systems, OpenCL.Funding This work was supported by the Galician Government (Consolidation of Competitive Research Groups, Xunta de Galicia ref. 2010/6, projects INCITE08PXIB105161PR and 08TIC001206PR), the Ministry of Science and Innovation, cofunded by the FEDER funds of the European Union (grant number TIN2010-16735, and project numbers MTM2009-11923 and MTM2010-21135).Xunta de Galicia; INCITE08PXIB105161PRXunta de Galicia; 08TIC001206P

    A multi-GPU shallow-water simulation with transport of contaminants

    Get PDF
    [Abstract] This work presents cost-effective multi-graphics processing unit (GPU) parallel implementations of a finite-volume numerical scheme for solving pollutant transport problems in bidimensional domains. The fluid is modeled by 2D shallow-water equations, whereas the transport of pollutant is modeled by a transport equation. The 2D domain is discretized using a first-order Roe finite-volume scheme. Specifically, this paper presents multi-GPU implementations of both a solution that exploits recomputation on the GPU and an optimized solution that is based on a ghost cell decoupling approach. Our multi-GPU implementations have been optimized using nonblocking communications, overlapping communications and computations and the application of ghost cell expansion to minimize communications. The fastest one reached a speedup of 78 × using four GPUs on an InfiniBand network with respect to a parallel execution on a multicore CPU with six cores and two-way hyperthreading per core. Such performance, measured using a realistic problem, enabled the calculation of solutions not only in real time but also in orders of magnitude faster than the simulated time.Copyright © 2012 John Wiley & Sons, Ltd

    Parallelware tools: an experimental evaluation on POWER systems

    Get PDF
    Static code analysis tools are designed to aid software developers to build better quality software in less time, by detecting defects early in the software development life cycle. Even the most experienced developer regularly introduces coding defects. Identifying, mitigating and resolving defects is an essential part of the software development process, but frequently defects can go undetected. One defect can lead to a minor malfunction or cause serious security and safety issues. This is magnified in the development of the complex parallel software required to exploit modern heterogeneous multicore hardware. Thus, there is an urgent need for new static code analysis tools to help in building better concurrent and parallel software. The paper reports preliminary results about the use of Appentra’s Parallelware technology to address this problem from the following three perspectives: finding concurrency issues in the code, discovering new opportunities for parallelization in the code, and generating parallel-equivalent codes that enable tasks to run faster. The paper also presents experimental results using well-known scientific codes and POWER systems.This work has been partly funded from the Spanish Ministry of Science and Technology (TIN2015-65316-P), the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya (MPEXPAR: Models de Programació i Entorns d’Execució Parallels, 2014-SGR-1051), and the European Union’s Horizon 2020 research and innovation program throughgrant agreements MAESTRO (801101) and EPEEC (801051).Peer ReviewedPostprint (author's final draft

    Efficient Parallel Numerical Solver for the Elastohydrodynamic Reynolds–Hertz Problem

    No full text
    This is a post-peer-review, pre-copyedit version of an article published in Parallel Computing: systems & applications. The final authenticated version is available online at: https://doi.org/10.1016/S0167-8191(01)00110-7[Abstract] This work presents a parallel version of a complex numerical algorithm for solving an elastohydrodynamic piezoviscous lubrication problem studied in tribology. The numerical algorithm combines regula falsi, fixed point techniques, finite elements and duality methods. The execution of the sequential program on a workstation requires significant CPU time and memory resources. Thus, in order to reduce the computational cost, we have applied parallelization techniques to the most costly parts of the original source code. Some blocks of the sequential code were also redesigned for the execution on a multicomputer. In this paper, our parallel version is described in detail, execution times that show its efficiency in terms of speedups are presented, and new numerical results that establish the convergence of the algorithm for higher imposed load values when using finer meshes are depicted. As a whole, this paper tries to illustrate the difficulties involved in parallelizing and optimizing complex numerical algorithms based on finite elements.Xunta de Galicia; XUGA 32201B97Gobierno de España; CICYT 1FD97-0118-C02Gobierno de España; D.G.E.S PB96-0341-C0
    corecore